Lecture 2, CSE 474/574

Probability

The vast majority of these notes are derived from Castella and Berger, and from Dr. Chandola's notebooks.

You should also treat these as a compliment to the Seeing Theory suggested readings that we will also discuss in class.

Basic Definitions

  1. $P(A) \geq 0$ for all $A \in \mathcal{B}$
  2. $P(S) = 1$
  3. If $A_1, A_2, ... \in \mathcal{B}$ are pairwise disjoint, then $P(\bigcup^{\infty}_{i=1} A_i) = \sum_{i=1}^\infty P(A_i)$

Definition of a probability function

Assume there are a set of nonnegative numbers $p_1, ..., p_n$ that sum to one. If: $$P(A) = \sum_{\{i: s_i \in A\}} p_i$$

Then $P$ is a probability function (ignoring the stuff with $\mathcal{B}$, now)

Example: Coin tossing

Exercise: Prove that if $P(\{H\}) = P(\{T\})$, then $P(\{H\}) = P(\{T\}) = \frac{1}{2}$

Conditional Probability

If $A$ and $B$ are events in $S$, and $P(B) \gt 0$, then the conditional probability of A given B, written as $P(A|B)$, is: $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ or equivalently, as $$P(A|B) = \frac{P(A = a, B =b)}{P(B=b)}$$ The latter notation will be useful below

Exercise Free sandwich to students. (Pg. 21-22)

Two variables are statistically independent if $P(A \cap B) = P(A)P(B)$. Naturally, if $P(A \cap B) \neq P(A)P(B)$, then the variables are not independent. Machine learning is largely about leveraging these conditional relationships between variables.

Random Variables

A random variable is a function from a sample space $S$ into the real numbers.

Exercise: What is a random variable we could compute with:

Exercise Define the random variable $X$ to be the number of heads on 3 coin tosses.

  1. Enumerate for each point in the sample space the value of $X$. (pg. 28)
  2. Define $P_X(X=x)$ based on the induced probability function
  3. What is $P_X(X=1)$?

Distribution functions of a random variable

CDF

The cumulative distribution function, or cdf, of a random variable $X$ is defined as: $$F_X(x) = P_X(X\leq x),\mathrm{\, for\, all\,} x$$

Exercise: Draw the $cdf$ of $X$ when $X$ is the number of heads observed after tossing 3 fair coins.

PMF

The probability mass function (pmf) of a discrete random variable $X$ is given by: $$f_X(x) = P(X=x),\mathrm{\, for\, all\,} x$$

Exercise: Why is this only for discrete random variables?

Note: $$ P(X \leq b) = \sum_{k=1}^b f_X (k) = F_X(b)$$

PDF

The probability density function (pdf) of a continuous random variable $X$ is the function that satisfies: $$F_X(x) = \int_{-\infty}^x f_X(t) dt$$

Continuous vs. Discrete

A random variable $X$ is continuous if $F_X(x)$ is a continuous function of $x$. It is discrete if $F_X(x)$ is a step function of $x$.

Exercise: What are some examples of continuous variables? Of discrete ones?

Identically distributed RVs

Two random variables $X$ and $Y$ are identically distributed if $F_X(x) = F_Y(x)$ for all $x$

Exercise Define two identically distributed random variables that can be computed from a single coin tossed three times.

Expected Value of a (Function of a) Random Variable

First, a note for this section: If $X$ is a RV, then any function of $X$, say $g(x)$, is also a random variable. For generality, we're going to use $g(X)$ instead of just $X$ here.

The expected value, or expectation, of a random variable is simply the variable's average value, weighted by the probability of that value occurring.

Exercise: What is the expected value of the random variable $X$, the number of heads I get when I flip a coin 100 times?

Formally, if $X$ is continuous, then:

$$ \mathbb{E} g(x) = \int_{-\infty}^{\infty} g(x) f_X(x) dx$$

And if $X$ is discrete, then:

$$ \mathbb{E} g(x) = \sum_{x \in \mathcal{X}} g(x)P(X=x)$$

A useful properties of expectations is linearity: if a, b are constants, then $ \mathbb{E}(aX + b) = a \mathbb{E}(X) + b$

Variance of a Random Variable

The variance of a random variable $X$ is defined as: $$ \mathrm{Var} = \mathbb{E}(X - \mathbb{E}(X))^2 $$.

The standard deviation is equal to $\sqrt{\mathrm{Var}(X)}$.

Both of these measures give you an idea of the spread of the distribution. That is, the larger the variance, the larger the ... variability ... in $X$.

Exercise What does it mean if $\mathrm{Var}(X) = 0$?

Probability distributions for univariate models (One random variable)

A distribution is a simplified model of a population. Because it is useful to have the same model be able to study different populations, statistical distributions are families, rather than a single pre-specified form. A specific instantiation of that family is determined by the value of the parameter(s) for that distribution.

Probability distributions for discrete random variables

A random variable is generated from a probability distribution. There are different types of distributions defined for discrete random variables. These include:

Bernoulli distribution

Bernoulli distribution represents a binary random variable which takes value 1 with success probability $p$ and value 0 with failure probability $q=1-p$. A bernoulli distribution has only one parameter: $p$.

Binomial distribution

Another popular distribution for a discrete random variable is the binomial distribution. A binomial distribution has two parameters $n$ and $\theta$, where $0 \le \theta \le 1$. The sample generated by a binomial distribution denotes the number of successes observed in a sequence of $n$ binary trials (e.g., toss of a coin) when the probability of each success is $\theta$.

The samples that are drawn from a binomial distribution range between 0 and $n$.

The probability distribution is defined as: \begin{equation} p(k;n,\theta) = P(X = k) = \binom{n}{k}\theta^k (1 - \theta)^{n-k} \end{equation}

Poisson distribution

Typically used to model counts. Domain ($\chi$) is $0 ... \infty$. It has one parameter, $\lambda$.

The probability mass function is given as: $P(X = k) = \frac{\lambda^ke^{-\lambda}}{k!}$.

The expected value or mean of $X$ is $\lambda$.

You see that the Poisson is a bit weird, the variance and the mean are encoded in a single parameter! The negative binomial distribution is very similar (with almost the same generative story), except that it allows for a difference between the values of the mean and variance of the distribution.

Hypergeometric

Continuous random variables

A continuous random variable can take an infinite number of possible values. Several interesting distributions exist:

Gaussian distribution

One of the most popular distribution is the Gaussian distribution. This distribution is defined for any number of variables. For a single variable case, the distribution is defined using two parameters: $\mu$ and $\sigma$. $\mu$ or the mean can take any value and $\sigma$ or the standard deviation is $\ge 0$.

For a continuous distribution, you cannot compute the probability mass at any value of the random variable. Instead, you can compute the density using the probability density function: $$p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp[-\frac{1}{2}(\frac{x - \mu}{\sigma})^2]$$ The random variable represented using a Gaussian distribution can take any value from $-\infty$ to $\infty$.

Real-world example

Consider heights and weights of a population sample

Multiple Random Variables (Multivariate models)

A fundamental component of multivariate statistics is a n-dimensional random vector, which is just a vector of random variables. It maps from a sample space $S$ into $\mathcal{R}^n$.

Let's simplify this, though, and just assume we have two random variables, $X$ and $Y$.

For a super neat visualization of conditional probability tables, check out the first section of this website.

Joint PMF

The joint probability mass function (joint pmf) if $X$ and $Y$ are discrete is defined as $f(x,y) = P(X=x,Y=y)$. This completely defines the probability vectorof the random vector $(X,Y)$.

Joint PDF

A function $f(x,y)$ is called a joint probability density function (joint pdf) if $X$ and $Y$ are continuous and for every $A \subset \mathcal{R}^2$:

$$P((X,Y) \in A) = \int_A \int f(x,y)dxdy$$

Expectations of multiple random variables

Expectations of (functions of) two random variables are computed in the same ways as for single random variables.

In the discrete case: $$\mathbb{E}g(X,Y) = \sum_{(x,y)\in \mathcal{R}^2} g(x,y) f(x,y)$$.

The marginal distribution

Again, assume that we are working only with the random vector $(X, Y)$. The marginal distribution of $X$ in this case is given as:

$$f_X(x) = \sum_{y \in \mathcal{R}} f_{X,Y}(x,y)$$

You marginalize over all values of $Y$, and then get a (well-specified) probability distribution over $X$.

Note the above is for the discrete case, the same follows in the expected fashion for multiple continous random variables.

Conditional distributions

The conditional probability $P(X|Y)$ is (like we saw above): $P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)}$.

The conditional distribution of $(X,Y)$ follows similarly:

$$f(y|x) = P(Y=y|X=x) = \frac{f(x,y)}{f_X(x)}$$

Independence of variables

Two variables are independent if for all $x \in \mathcal{R}$ and $y in \mathcal{R}$, $f(x,y) = f_X(x) f_Y(y)$. You can compute the probability of the two observations independently, and then just multiply them together!

Covariance and Correlation

When random variables are not independent, we want to know the strength of the relationship between them. One measure of that is the covariance, defined as:

$$\mathrm{Cov}(X,Y) = \mathbb{E}((X-\mu_X)(Y-\mu_Y))$$

where $\mu_X = \mathbb{E}(X)$ and $\mu_Y = \mathbb{E}(Y)$.

The correlation between $X$ and $Y$ is defined by:

$$\rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} $$

where $\sigma_X = \sqrt{\mathrm{Var}(X)}$, and analogously for $Y$.

Exercise: What is an example of two quantities that you would expect to see $\rho_{XY} = 0$? What about $\rho_{XY} = .9$? What about $\rho_{XY}= -.9$.

A reminder for later... correlation is not causation, yes, but also, be careful even with correlation!.

Some multivariate probability distributions

Multinoulli distribution

A multinoulli distribution is a generalization of Bernoulli distribution for trials which can take more than two possibilities ($k > 2$). The parameter for multinoulli distribution is a vector ${\bf \theta}$ which has $k$ entries. Each entry $\theta_i$ indicates the probability of observing the category $i$ in a single trial.

Multinomial distribution

A multinomial distribution is a generalization of Binomial distribution for trials which can take more than two possibilities. The parameters for the multinomial distribution is a vector ${\bf \theta}$ and $n$.

Multi-dimensional or multivariate Gaussian distribution

A distribution can be defined for multivariate random variables. One example is the multivariate Gaussian. In general, the random variable is a $D$ length vector ${\bf X}$. The two parameters of this distribution are a mean vector ${\bf \mu}$ and a covariance matrix $\Sigma$. The pdf at any value of ${\bf X}$ is given by: $$ \mathcal{N}({\bf X}| {\bf \mu,\Sigma}) \triangleq \frac{1}{(2\pi)^{D/2}|{\bf \Sigma}|^{D/2}}exp\left[-\frac{1}{2}{\bf (x-\mu)^\top\Sigma^{-1}(x-\mu)}\right] $$ Note that if $D = 1$, it reduces to a univariate Gaussian distribution.